Goto

Collaborating Authors

 nearest neighbour


The Theory and Practice of Highly Scalable Gaussian Process Regression with Nearest Neighbours

Allison, Robert, Maciazek, Tomasz, Stephenson, Anthony

arXiv.org Machine Learning

Gaussian process ($GP$) regression is a widely used non-parametric modeling tool, but its cubic complexity in the training size limits its use on massive data sets. A practical remedy is to predict using only the nearest neighbours of each test point, as in Nearest Neighbour Gaussian Process ($NNGP$) regression for geospatial problems and the related scalable $GPnn$ method for more general machine-learning applications. Despite their strong empirical performance, the large-$n$ theory of $NNGP/GPnn$ remains incomplete. We develop a theoretical framework for $NNGP$ and $GPnn$ regression. Under mild regularity assumptions, we derive almost sure pointwise limits for three key predictive criteria: mean squared error ($MSE$), calibration coefficient ($CAL$), and negative log-likelihood ($NLL$). We then study the $L_2$-risk, prove universal consistency, and show that the risk attains Stone's minimax rate $n^{-2α/(2p+d)}$, where $α$ and $p$ capture regularity of the regression problem. We also prove uniform convergence of $MSE$ over compact hyper-parameter sets and show that its derivatives with respect to lengthscale, kernel scale, and noise variance vanish asymptotically, with explicit rates. This explains the observed robustness of $GPnn$ to hyper-parameter tuning. These results provide a rigorous statistical foundation for $NNGP/GPnn$ as a highly scalable and principled alternative to full $GP$ models.




Supplementary Material for GPEX, A Framework For Interpreting Artificial Neural Networks Amir Akbarnejad, Gilbert Bigras, Nilanjan Ray

Neural Information Processing Systems

Fig. S1: The proposed framework as a probabilistic graphical model. In this section we derive the variational lower-bound introduced in Sec.2.3 of the main article. W e firstly introduce Lemmas 1 and 2 as they appear in our derivations. As illustrated in Fig.S1, the ANN's input In Fig.S1 the lower boxes are the inducing points and other variables that determine the GPs' posterior. S1.1 Deriving the Lower-bound With Respect to the Kernel-mappings In the right-hand-side of Eq.S6 only the following terms are dependant on the kernel-mappings The first term is the expected log-likelihood of a Gaussian distribution (i.e. the conditional log-likelihood of Therefore, we can use Lemma.2 to simplify the first term: E According to Lemma.1 we have that Therefore, the KL-term of Eq.S8 is a constant with respect to the kernel mappings All in all, the lower-bound for optimizing the kernel-mappings is equal to the right-hand-side of Eq.S9 which was introduced and discussed in Sec.2.3. of the main article. S1.2 Deriving the Lower-bound With Respect to the ANN Parameters According to Eq.4 of the main article, in our formulation the ANN's parameters appear as some variational parameters. Therefore, the likelihood of all variables (Eq.S6) does not generally depend on the ANN's parameters. This likelihood turns out to be equivalent to commonly-used losses like the cross-entropy loss or the mean-squared loss. Here we elaborate upon how this happens. This conclusion was introduced and discussed in Eq.6 of the main article. W e can draw similar conclusions when the pipeline is for other tasks like regression, or even a combination of tasks.